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Abstract: In this paper a stochastic generahsation of the standard Linde-Buzo- 
Gray (LBG) approach to vector quantiser (VQ) design is presented, in which 
the encoder is implemented as the samphng of a vector of code indices from a 
probabihty distribution derived from the input vector, and the decoder is imple- 
mented as a superposition of reconstruction vectors, and the stochastic VQ is 
optimised using a minimum mean Euclidean reconstruction distortion criterion, 
as in the LBG case. Numerical simulations are used to demonstrate how this 
leads to self-organisation of the stochastic VQ, where different stochastically 
sampled code indices become associated with different input subspaces. This 
property may be used to automate the process of splitting high-dimensional 
input vectors into low-dimensional blocks before encoding them. 

1 Introduction 

In vector quantisation a code book is used to encode each input vector as a 
corresponding code index, which is then decoded (again, using the codebook) 
to produce an approximate reconstruction of the original input vector [S] [2] . The 
purpose of this paper is to generalise the standard approach to vector quantiser 
(VQ) design [7j , so that each input vector is encoded as a vector of code indices 
that are stochastically sampled from a probability distribution that depends on 
the input vector, rather than as a single code index that is the deterministic 
outcome of finding which entry in a code book is closest to the input vector. 
This will be called a stochastic VQ (SVQ), and it includes the standard VQ 
as a special case. Note that this approach is different from the various soft 
competition and stochastic relaxation schemes that are used to train VQs (see 
e.g. [13]), because here the probability distribution is an essential part of the 
encoder, both during and after training. 

One advantage of using the stochastic approach, which will be demonstrated 
in this paper, is that it automates the process of splitting high-dimensional input 
vectors into low-dimensional blocks before encoding them, because minimising 
the mean Euclidean reconstruction error can encourage different stochastically 
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sampled code indices to become associated with different input subspaces |10) . 
Another advantage is that it is very easy to connect SVQs together, by using 
the vector of code index probabihties computed by one SVQ as the input vector 
to another SVQ [H]- 

In section [5] various pieces of previously published theory are unified to give 
a coherent account of SVQs. In section |3] the results of some new numerical 
simulations are presented, which demonstrate how the code indices in a SVQ 
can become associated in various ways with input subspaces. 



2 Theory 

In this section various pieces of previously published theory are unified to es- 
tablish a coherent framework for modelling SVQs. 

In section [2J] the basic theory of folded Markov chains (FMC) is given [8], 
and in section 12.21 it is extended to the case of high-dimensional input data 
[5]. In section [13] some properties of the solutions that emerge when the input 
vector lives on a 2-torus are summarised [10]. Finally, in section [2^ the theory 
is further generalised to chains of linked FMCs 



2.1 Folded Markov Chains 

The basic building block of the encoder/decoder model used in this paper is the 
folded Markov chain (FMC) [8 . Thus an input vector x is encoded as a code 
index vector y, which is then subsequently decoded as a reconstruction x' of 
the input vector. Both the encoding and decoding operations are allowed to be 
probabilistic, in the sense that y is a sample drawn from Pr(y|x), and x' is a 
sample drawn from Pr(x'|y), where Pr(y|x) and Pr(x|y) are Bayes' inverses 
of each other, as given by 

Pr(x|y)= P'-(yWP^W (1) 
^ '^^ /rfzPr(y|z)Pr(z) ^ ' 

and Pr (x) is the prior probability from which x was sampled. 

Because the chain of dependences in passing from x to y and then to x' is 
first order Markov (i.e. it is described by the directed graph x — > y — !■ x'), 
and because the two ends of this Markov chain (i.e. x and x') live in the same 
vector space, it is called a folded Markov chain (FMC). The operations that 
occur in an FMC are summarised in figure [TJ 

In order to ensure that the FMC encodes the input vector optimally, a mea- 
sure of the reconstruction error must be minimised. There are many possible 
ways to define this measure, but one that is consistent with many previous 
results, and which also leads to many new results, is the mean Euclidean recon- 
struction error measure D, which is defined as 

„ M 

/ dxPr(x)^Pr(y|x) / dx' Pr (x'|y) ||x - x'||' (2) 

J y=l J 
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Figure 1: A folded Markov chain (FMC) in which an input vector x is encoded 
as a code index vector y that is drawn from a conditional probability Pr (y|x), 
which is then decoded as a reconstruction vector x' drawn from the Bayes' 
inverse conditional probability Pr (x'|y). 



where Pr (x) Pr (y|x) Pr (x'|y) is the joint probability that the FMC has state 
(x, y, x'), ||x — x'll'^ is the Euclidean reconstruction error, and J dx X]y=i / (' ' 
sums over all possible states of the FMC (weighted by the joint probability). 
The code index vector y is assumed to lie on a rectangular lattice of size M. 

The Bayes' inverse probability Pr(x'|y) may be integrated out of this ex- 
pression for D to yield 

„ M 

D = 2 dxPr(x)^Pr(y|x)||x-x'(y)f (3) 
•' y=i 

where the reconstruction vector x' (y) is defined as x' (y) = J dxPr(x|y) x. 
Because of the quadratic form of the objective function, it turns out that x' (y) 
may be treated as a free parameter whose optimum value (i.e. the solution of 
g^r^y-^ = 0) is J dxPr (x|y) x, as required. 

If D is now minimised with respect to the probabilistic encoder Pr (y|x) and 
the reconstruction vector x' (y) . then the optimum has the form 

Pr (y|x) = 5y_y(x) 

, , argmin ,, ,,,2 

y W = y ||x-x (y)|| 

x'(y) = yrfxPr(x|y)x (4) 

where Pr (y|x) has reduced to a deterministic encoder (as described by the Kro- 
necker delta '5y,y(x))j y (x) is a nearest neighbour encoding algorithm using code 
vectors x' (y) to partition the input space into code cells, and (in an optimised 
configuration) the x' (y) are the centroids of these code cells. This is equivalent 
to a standard VQ [7J. 

An extension of the standard VQ to the case of where the code index is 
transmitted along a noisy communication channel before the reconstruction is 
attempted [SI [T] was derived in [5] , and was shown to lead to a good approxi- 
mation to a topographic mapping neural network |5J. 
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2.2 High Dimensional Input Spaces 

A problem with the standard VQ is that its code book grows exponentially 
in size as the dimensionality of the input vector is increased, assuming that 
the contribution to the reconstruction error from each input dimension is held 
constant. This means that such VQs are useless for encoding extremely high 
dimensional input vectors, such as images. The usual solution to this problem 
is to partition the input space into a number of lower dimensional subspaces 
(or blocks), and then to encode each of these subspaces separately. However, 
this produces an undesirable side effect, where the boundaries of the blocks are 
clearly visible in the reconstruction; this is the origin of the blocky appearance 
of reconstructed images, for instance. 

There is also a much more fundamental objection to this type of partition- 
ing, because the choice of blocks should ideally be decided in such a way that 
the correlations within a block are much stronger than the correlations between 
blocks. In the case of image encoding, this will usually be the case if the par- 
titioning is that each block consists of contiguous image pixels. However, more 
generally, there is no guarantee that the input vector statistics will respect the 
partitioning in this convenient way. Thus it will be necessary to deduce the best 
choice of blocks from the training data. 

In order to solve the problem of finding the best partitioning, consider the 
following constraint on Pr (y|x) and x' (y) in the FMC objective function in 
equation [3] 

y = {yi,y2,---,y,i),i<yi<M 

Pr(y|x) = Pr(yi|x)Pr(2;2|x)---Pr(y„|x) 
1 " 

x'(y) = -5]x'(y,) (5) 

i=l 

Thus the code index vector y is assumed to be n-dimensional, each component 
Ui (for i = 1, 2, • • • , n and 1 < j/i < M) is an independent sample drawn from 
Pr(y|x), and the reconstruction vector x' (y) (vector argument) is assumed to be 
a superposition of n contributions x' (y^) (scalar argument) for i — 1,2, ■ ■ ■ ,n. 
As D is minimised, this constraint allows partitioned solutions to emerge by a 
process of self-organisation. 

For instance, solutions can have the structure illustrated in figure [21 This 
type of structure is summarised as follows 

X = (xi,X2, • • • ,XAr) 
Pr(y|x) = Pt {yi\xp(^y^)) Pt {y2\yip(y^)) ■ ■ ■ Pt {yn\Xp(y^)) (6) 

In this type of solution the input vector x is partitioned as (xi,X2, • • • ,xn), 
the probability Pr{y\x) reduces to Pr {y\xp(y)^ which depends only on Xp(y), 
where the function p (y) computes the index of the block which code index y 
inhabits. There is not an exact correspondence between this type of partitioning 
and that used in the standard approach to encoding image blocks, because here 
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Figure 2: A typical solution in which the components of the input vector x 
and the range of values of the output code index y are both partitioned into 
blocks. These input and output blocks are connected together as illustrated. 
More generally, the blocks may overlap each other. 



the n code indices are spread at random over the N image blocks, which does 
not guarantee that every block is encoded. Although, for a given N, if n is 
chosen to be sufhently large, then there is a virtual certainty that every block 
is encoded. This is the price that has to be paid when it is assumed that the 
code indices are drawn independently. 

The constraints in equation [5] prevent the full space of possible values of 
Pr (y|x) or x' (y) from being explored as D is minimised, so they lead to an 
upper bound Di + D2 on the FMC objective function D (i.e. D < Di + D2), 
which may be derived as [S] 



Di = 



„ M 

/ dxPr(x)^Pr(?/|x) ||x-x' 



M 



x-^Pr (y|x) x' (y) 



(7) 



Note that M and n are model order parameters, whose values need to be chosen 
appropriately for each encoder optimisation problem. 

For n — \ only the D\ term contributes, and it is equivalent to the FMC 
objective function Z) in equation [3] with the vector code index y replaced by a 
scalar code index y, so its minimisation leads to a standard vector quantiser as in 
equationlH in which each input vector is approximated by a single reconstruction 
vector x' (y). 

When n becomes large enough that D2 dominates over Di, the optimisation 
problem reduces to minimisation of the mean Euclidean reconstruction error 
(approximately). This encourages the approximation x « X)^=i iul^) ^' in) 
to hold, in which the input vector x is approximated as a weighted (using weights 
Pr(?/|x)) sum of many reconstruction vectors x' (y). In numerical simulations 
this has invariably led to solutions which are a type of principal components 
analysis (PCA) of the input vectors, where the expansion coefficients Pr(y|x) 
are constrained to be non-negative and sum to unity. Also, the approximation 
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X « P'^ {y\^) ^' (y) is very good for solutions in which Pr (y|x) depends 

on the whole of x, rather than merely on a subspace of x, so this does not lead 
to a partitioned solution. 

For intermediate values of n, where both Di and D2 are comparable in 
size, partitioned solutions can emerge. However, in this intermediate region the 
properties of the optimum solution depend critically on the interplay between 
the statistical properties of the training data and the model order parameters 
M and n. To illustrate how the choice of M and n affects the solution, the case 
of input vectors that live on a 2-torus is summarised in section 12.31 

When the full FMC objective function in equation [3] is optimised it leads to 
the standard (deterministic) VQ in equation However, it turns out that the 
constrained FMC objective function Di + D2 in equation [7] does not allow a 
deterministic VQ to emerge (except in the case n = 1), because a more accurate 
reconstruction can be obtained by allowing more than one code index to be 
sampled for each input vector. Because of this behaviour, in which the encoder 
is stochastic both during and after training, this type of constrained FMC will 
be called a SVQ. 

2.3 Example: 2-Torus Case 

A scene is defined as a number of objects at specified positions and orientations, 
so is it specified by a low-dimensional vector of scene coordinates, which are 
effectively the intrinsic coordinates of a low-dimensional manifold. An image 
of that scene is an embedding of the low-dimensional manifold in the high- 
dimensional space of image pixel values. Because the image pixel values are 
non-linearly related to the vector of scene coordinates, this embedding operation 
distorts the manifold so that it becomes curved. The problem of finding the 
optimal way to encode images may thus be viewed as the problem of finding the 
optimal way to encode curved manifolds, where the instrinsic dimensionality of 
the manifold is the same as the dimensionality of the vector of scene coordinates. 

The simplest curved manifold is the circle, and the next most simple curved 
manifold is the 2-torus (which has 2 intrinsic circular coordinates). By making 
extensive use of the fact that the optimal form of Pr(?;|x) must be a piece- 
wise linear function of the input vector x [11 j . the optimal encoders for these 
manifolds have been derived analytically |10| . The toroidal case is very inter- 
esting because it demonstrates the transition between the unpartitioned and 
partitioned optimum solutions as n is increased. A 2-torus is a realistic model 
of the manifold generated by the linear superposition of 2 sine waves, having 
fixed amplitudes and wavenumbers. The phases of the 2 sine waves are then 
the 2 intrinsic circular coordinates of this manifold, and if these phases are both 
uniformly distributed on the circle, then Pr (y|x) defines a constant probability 
density on the 2-torus. 

A typical Pr(y|x) for small n is illustrated in figure [3l Only in the case 
where n = 1 would Pr (?/|x) correspond to a sharply defined code cell; for n > I 
the edges of the code cells are tapered so that they overlap with one other. 
The 2-torus is covered with a large number of these overlapping code cells, and 
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Figure 3: A typical probability Pr (y|x) (only one y value is illustrated) for 
encoding a 2-torus using a small value of n. This defines a smoothly tapered 
localised region on the 2-torus. The toroidal mesh serves only to help visualise 
the 2-torus. 




Figure 4: Two typical probabilities Pr(?/i|x) and Pr(?/2|x) for encoding a 2- 
torus using a large value of n. Each separately defines a smoothly tapered 
collar-shaped region on the 2-torus. However, taken together, their region of 
intersection defines a smoothly tapered localised region on the 2-torus. The 
toroidal mesh serves only to help visualise the 2-torus. 

when a code index y is sampled from Pr (yjx), it allows a reconstruction of the 
input to be made to within an uncertainty area commensurate with the size of a 
code cell. This type of encoding is called joint encoding, because the 2 intrinsic 
dimensions of the 2-torus are simultaneously encoded by y. 

A typical pair of Pr(y|x) (i.e. Pr(yi|x) and Pr(y2|x)) for large n is illus- 
trated in figure 21 The partitioning splits the code indices into two different 
types: one type encodes one of the intrinsic dimensions of the 2-torus, and the 
other type encodes the other intrinsic dimension of the 2-torus, so each code 
cell is a tapered collar-shaped region. When the code indices (yi, 2/2, • • • , are 
sampled from Pr (y|x), they allow a reconstruction of the input to be made to 
within an uncertainty area commensurate with the size of the region of inter- 
section of a pair of orthogonal code cells, as illustrated in figure 21 This type of 
encoding is called factorial encoding, because the 2 intrinsic dimensions of the 
2-torus are separately encoded in (j/i, y2,' ■ ■ , Un)- 

For a 2-torus there is an upper limit M w 12 beyond which the optimum 
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Figure 5: A chain of linked FMCs, in which the output from each stage is its 
vector of posterior probabiHties (for all values of the code index) , which is then 
used as the input to the next stage. Only 3 stages are shown, but any number 
may be used. More generally, any acyclically linked network of FMCs may be 
used. 

solution is always a joint encoder (as shown in figure [3]). This limit arises 
because when M > 12 the code book is sufficiently large that the joint encoder 
gives the best reconstruction for all values of n. This result critically depends 
on the fact that as n is increased the code cells overlap progressively more and 
more, so the accuracy of the reconstruction progressively increases. For M > 12 
the rate at which the accuracy of the joint encoder improves (as n increases) 
is sufBciently great that it is always better than that of the factorial encoder 
(which also improves as n increases). 

2.4 Chains of Linked FMCs 

Thus far it has been shown that the FMC objective function in equation [3l 
with the constraints imposed in equation [Sj leads to useful properties, such as 
the automatic partitioning of the code book to yield the factorial encoder, such 
as that illustrated in figure 0] (and more generally, as illustrated in figure [5]). 
The free parameters (M, n) (i.e. the size of the code book, and the number of 
code indices sampled) can be adjusted to obtain an optimal solution that has 
the desired properties (e.g. a joint or a factorial encoder, as in figures [3] and |4l 
respectively). However, since there are only 2 free parameters, there is a limit to 
the variety of types of properties that the optimal solution can have. It would 
thus be very useful to introduce more free parameters. 

The FMC illustrated in figure[T]may be generalised to a chain of linked FMCs 
as shown in figure [5l Each stage in this chain is an FMC of the type shown 
in figure [1] and the vector of probabilities (for all values of the code index) 
computed by each stage is used as the input vector to the next stage; there 
are other ways of linking the stages together, but this is the simplest possibility. 
The overall objective function is a weighted sum of the FMC objective functions 
derived from each stage. The total number of free parameters in an L stage chain 
is 3L — 1, which is the sum of 2 free parameters for each of the L stages, plus 
L — 1 weighting coefficients; there are L — 1 rather than L weighting coefficients 
because the overall normalisation of the objective function does not affect the 
optimum solution. 

The chain of linked FMCs may be expressed mathematically by first of all 
introducing an index / to allow different stages of the chain to be distinguished 
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thus 

M — > M'-^\y — >t/('^ 

X x('),x'-^xW' 

n — > n^'\D — 

Di D^\D2^D'i^ (8) 

The stages are then defined and hnked together thus (the detailed are given 
only as far as the input to the third stage) 



x(i) 




x(2) 




(2) 


= Pr (y' 


X(2) 




X(3) 


- {^?\ 


^(3) 


= Pr (y' 



.(2) 



(yd) = *|x(i)) ,l<i< M(i) 



= ^|x(2)) , 1 < z < m(2) (9) 



The objective function and its upper bound are then given by 

D = 

(=1 

< D1 + D2 

L 



5: .(0(^(0+^(0) (10) 



1=1 

where s^'^ > is the weighting that is applied to the contribution of stage I of 
the chain to the overall objective function. 

The piecewise linearity property enjoyed by Pr (j/|x) in a single stage chain 
also holds for all of the Pr (y^'^jx^')) in a multi-stage chain, provided that the 
stages are linked together as prescribed in equation [9] [11 . This will allow 
optimum analytic solutions to be derived by an extension of the single stage 
methods used in |10| . 



3 Simulations 

In this section the results of various numerical simulations are presented, which 
demonstrate some of the types of behaviour exhibited by an encoder that con- 
sists of a chain of linked FMCs. Synthetic, rather than real, training data are 
used in all of the simulations, because this allows the basic types of behaviour 
to be cleanly demonstrated. 
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In section [33] the training algorithm is presented. In section [221 the training 
data is described. In section [331 a single stage encoder is trained on data that 
is a superposition of two randomly positioned objects. In section 13.41 this is 
generalised to objects with correlated positions, and three different types of be- 
haviour are demonstrated: factorial encoding using both a 1-stage and a 2-stage 
encoders (section [3.4. ip . joint encoding using a 1-stage encoder (section [3. 4. 2p . 
and invariant encoding (i.e. ignoring a subspace of the input space altogether) 
using a 2-stage encoder (section [3.4.3p . 



3.1 Training Algorithm 

Assuming that Pr (y|x) is modelled as in appendix[^(i.e. Pr(y|x)= ^j^, Q(y'\x) 

and Q {y\ x) — i-i-cxp(— w(y)-x— )' thcu thc partial derivatives of Di -\- D2 with 
respect to the 3 types of parameters in a single stage of the encoder may be 
denoted as 



(y) 
9b (y) 



d{Di + D2) 
9w (y) 

d{Di + D2) 
9b (y) 



d{Di + D2) 

This may be generalised to each stage of a multi-stage encoder by including an 
(l) superscript, and ensuring that for each stage the partial derivatives include 
the additional contributions that arise from forward propagation through later 
stages; this is essentially an application of the chain rule of differentiation, using 
the derivatives g^^i)^y(i)^ and g^(T)'^^(o j to link the stages together (see appendix 
|A|. 

A simple algorithm for updating these parameters is (omitting the (/) super- 
script, for clarity) 

w{y) — > w [y) - 



b{y) — > h (y) - e 



9w,o 

9b jy) 

9b,a 



x'(y) ^ x'(y)-eM^ (12) 
9x,o 

where e is a small update step size parameter, and the three normalisation 
factors are defined as 



9w,o 
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9b,o = 




^j^g gnla) ^j^^ gxfe) factors ensure that the maximum update step size for w (y) 

and x' {y) is edimx (i.e. e per dimension), and the factor ensures that 

the maximum update step size for b (y) is e. This update algorithm can be 
generahsed to use a different e for each stage of the encoder, and also to allow a 
different e to be used for each of the 3 types of parameter. Furthermore, the size 
of e can be varied as training proceeds, usually starting with a large value, and 
then gradually reducing its size as the solution converges. It is not possible to 
give general rules for exactly how to do this, because training conditions depend 
very much on the statistical properties of the training set. 



3.2 Training Data 

The key property that this type of self-organising encoder exhibits is its ability 
to automatically split up high-dimensional input spaces into lower-dimensional 
subspaces, each of which is separately encoded. For instance, see section for 
a summary of the analytically solved case of training data that lives on a simple 
curved manifold (i.e. a 2-torus). This self-organisation manifests itself in many 
different ways, depending on the interplay between the statistical properties of 
the training data, and the 3 free parameters (i.e. the code book size M, the 
number of code indices sampled n, and the stage weighting s) per stage of the 
encoder (see section I^Tl)) . However, it turns out that the joint and factorial 
encoders (of the same general type as those obtained in the case of a 2-torus) 
are also the optimum solutions for more general curved manifolds. 

In order to demonstrate the various different basic types of self-organisation 
it is necessary to use synthetic training data with controlled properties. All of 
the types of self-organisation that will be demonstrated in this paper may be 
obtained by training a 1-stage or 2-stage encoder on 24-dimensional data (i.e. 
M — 24) that consists of a superposition of a pair of identical objects (with 
circular wraparound to remove edge effects) , such as is shown in figure |6l 

The training data is thus uniformly distributed on a manifold with 2 intrinsic 
circular coordinates, which is then embedded in a 24-dimensional image space. 
The embedding is a curved manifold, but is not a 2-torus, and there are two 
reasons for this. Firstly, even though the manifold has 2 intrinsic circular co- 
ordinates, the non-linear embedding distorts these circles in the 24-dimensional 
embedding space so that they are not planar (i.e. the profile of each object lives 
in the full 24-dimensional embedding space). Secondly, unlike a 2-torus, each 
point on the manifold maps to itself under interchange of the pair of circular 
coordinates, so the manifold is covered twice by a 2-torus (i.e. the objects are 
identical, so it makes no difference if they are swapped over). However, these 
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Figure 6: An example of a typical training vector for M = 24. Each object is 
a Gaussian hump with a half- width of 1.5 units, and peak amplitude of 1. The 
overall input vector is formed as a linear superposition of the 2 objects. Note 
that the input vector is wrapped around circularly to remove minor edge effects 
that would otherwise arise. 

differences do not destroy the general character of the joint and factorial encoder 
solutions that were obtained in section [521 

In the simulations presented below, two different methods of selecting the 
object positions are used: either the positions are statistically independent, or 
they are correlated. In the independent case, each object position is a random 
integer in the interval [1,24]. In the correlated case, the first object position is 
a random integer in the interval [1, 24], and the second object position is chosen 
relative to the first one as an integer in the range [4, 8], so that the mean object 
separation is 6 units. 

3.3 Independent Objects 

The simplest demonstration is to let a single stage encoder discover the fact 
that the training data consists of a superposition of a pair of objects, which is 
an example of independent component analysis (ICA) or blind signal separation 
(BSS) [4J. This may readily be done by setting the parameters values as follows: 
code book size M = 16, number of code indices sampled n = 20, e = 0.2 for 250 
training steps, e — 0.1 for a further 250 training steps. 

The self-organisation of the 16 reconstruction vectors as training progresses 
(measured down the page) is shown in figure [T] 

After some initial confusion, the reconstruction vectors self-organise so that 
each code index corresponds to a single object at a well defined location. This 
behaviour is non-trivial, because each training vector is a superposition of a pair 
of objects at independent locations, so typically more than one code index must 
be sampled by the encoder, which is made possible by the relatively large choice 
n = 20. This result is a factorial encoder, because the objects are encoded 
separately. This is a rudimentary example of the type of solution that was 
illustrated in figure [21 although here the blocks overlap each other. 

The case of a joint encoder requires a rather large code book when the 
objects are independent. However, when correlations between the objects are 
introduced then the code book can be reduced to a manageable size, as will be 
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Figure 7: A factorial encoder emerges when a single stage encoder is trained on 
data that is a superposition of 2 objects in independent locations. 

demonstrated in the next section. 
3.4 Correlated Objects 

A more interesting situation arises if the positions of the pair of objects are 
mutually correlated, so that the training data is non-uniformly distributed on a 
manifold with 2 intrinsic circular coordinates. The pair of objects can then be 
encoded in 3 fundamentally different ways: 

1. Factorial encoder. This encoder ignores the correlations between the ob- 
jects, and encodes them as if they were 2 independent objects. Each 
code index would thus encode a single object position, so many code in- 
dices must be sampled in order to virtually guarantee that both object 
positions are encoded. This result is a type of independent component 
analysis (ICA) g]. 

2. Joint encoder. This encoder regards each possible joint placement of the 
2 objects as a distinct configuration. Each code index would thus encode 
a pair of object positions, so only one code index needs to be sampled in 
order to guarantee that both object positions are encoded. This result is 
basically the same as what would be obtained by using a standard VQ [7] . 

3. Invariant encoder. This encoder regards each possible placement of the 
centroid of the 2 objects as a distinct configuration, but regards all possible 
object separations (for a given centroid) as being equivalent. Each code 
index would thus encode only the centroid of the pair of objects. This 
type of encoder does not arise when the objects are independent. This is 
similar to self-organising transformation invariant detectors described in 
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Each of these 3 possibihties is shown in figure [5J where the diagrams are meant 
only to be illustrative. The correlated variables live in the large 2-dimensional 
rectangular region extending from bottom-left to top-right of each diagram. For 
data of the type shown in figure [51 the rectangular region is in reality the curved 
manifold generated by varying the pair of object coordinates, and the invariance 
of the data under interchange of the pair of object coodinates means that the 
upper left and lower right halves of each diagram cover the manifold twice. 

The factorial encoder has two orthogonal sets of long thin rectangular code 
cells, and the diagram shows how a pair of such cells intersect to define a small 
square code cell. The joint encoder behaves as a standard vector quantiser, and 
is illustrated as having a set of square code cells, although their shapes will not 
be as simple as this in practice. The invariant encoder has a set of long thin 
rectangular code cells that encode only the long diagonal dimension. 

In all 3 cases there is overlap between code cells. In the case of the factorial 
and joint encoders the overlap tends to be only between nearby code cells, 
whereas in the case of an invariant encoder the range of the overlap is usually 
much greater, as will be seen in the numerical simulations below. In practice 
the optimum encoder may not be a clean example of one of the types illustrated 
in figure IHl as will also be seen in the numerical simulations below. 

3.4.1 Factorial Encoding 

A factorial encoder may be trained by setting the parameter values as follows: 
code book size M — 16, number of code indices sampled n = 20, e = 0.2 for 500 
training steps, e = 0.1 for a further 500 training steps. This is the same as in 
the case of independent objects, except that the number of training steps has 
been doubled. 

The result is shown in figure [HI which should be compared with the result 
for independent objects in figure [T] The presence of correlations degrades the 
quality of this factorial code relative to the case of independent objects. The 
contamination of the factorial code takes the form of a few code indices which 
respond jointly to the pair of objects. 

The joint coding contamination of the factorial code can be reduced by using 
a 2-stage encoder, in which the second stage has the same values of M and n 
as the first stage (although identical parameter values are not necessary), and 
(in this case) both stages have the same weighting in the objective function (see 
equation iro]). 

The results are shown in figure ITOl The reason that the second stage encour- 
ages the first to adopt a pure factorial code is quite subtle. The result shown in 
figure [TU] will lead to the first stage producing an output in which 2 code indices 
(one for each object) typically have probability i of being sampled, and all of the 
remaining code indices have a very small probability (this is an approximation 
which ignores the fact that the code cells overlap). On the other hand, figure [S] 
will lead to an output in which the probability is sometimes concentrated on a 
single code index. However, the contribution of the second stage to the overall 
objective function encourages it to encode the vector of probabilities output by 
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Figure 8: Three alternative ways of using 30 code indices to encode a pair of 
correlated variables. The typical code cells are shown in bold. 
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Figure 9: A factorial encoder emerges when a single stage encoder is trained on 
data that is a superposition of 2 objects in correlated locations. 



1 

i 


■ 








n 


■ 


■ 


m 


a 


■ 


n 




■ 




IB 



Figure 10: The factorial encoder is improved, by the removal of the joint en- 
coding contamination, when a 2-stage encoder is used. 
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Figure 11: A joint encoder emerges when a single stage encoder is trained on 
data that is a superposition of 2 objects in correlated locations. 

the first stage with minimum Euclidean reconstruction error, which is easier to 
do if the situation is as in figure [TU] rather than as in figure [HI In effect, the 
second stage likes to see an output from the first stage in which a large number 
of code indices are each sampled with a low probability, which favours factorial 
coding over joint encoding. 

3.4.2 Joint Encoding 

A joint encoder may be trained by setting the parameter values as follows: code 
book size M = 16, number of code indices sampled n = 3, e = 0.2 for 500 
training steps, e — 0.1 for a further 500 training steps, e — 0.05 for a further 
1000 training steps. This is the same as the parameter values for the factorial 
encoder above, except that n has been reduced to n = 3, and the training 
schedule has been extended. 

The result is shown in figure [TlJ After some initial confusion, the recon- 
struction vectors self-organise so that each code index corresponds to a pair of 
objects at well defined locations, so the code index jointly encodes the pair of 
object positions; this is a joint encoder. The small value of n prevents a factorial 
encoder from emerging. 

3.4.3 Invariant Encoding 

An invariant encoder may be trained by using a 2-stage encoder, and setting 
the parameter values identically in each stage as follows (where the weighting 
of the second stage relative to the first is denoted as s): code book size M = 16, 
number of code indices sampled n = 3, e = 0.2 and s — 5 for 500 training steps, 
e = 0.1 and s = 10 for a further 500 training steps, e = 0.05 and s = 20 for a 
further 500 training steps, e = 0.05 and s = 40 for a further 500 training steps. 
This is basically the same as the parameter values used for the joint encoder 
above, except that there are now 2 stages, and the weighting of the second stage 
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Figure 12: An invariant encoder emerges when 2-stage encoder is trained on 
data that is a superposition of 2 objects in correlated locations. 

is progressively increased throughout the training schedule. Note that the large 
value that is used for s is offset to a certain extent by the fact that the ratio 
of the normalisation of the inputs to the first and second stages is very large; 
the anomalous normalisation of the input to the first stage could be removed by 
insisting that the input to the first stage is a vector of probabilities, but that is 
not done in these simulations. 

The result is shown in figure 1121 During the early part of the training 
schedule the weighting of the second stage is still relatively small, so it has the 
effect of turning what would otherwise have been a joint encoder into a factorial 
encoder; this is analogous to the effect observed when figure IH] becomes figure 
[TUl However, as the training schedule progresses the weighting of the second 
stage increases further, and the reconstruction vectors self-organise so that each 
code index corresponds to a pair of objects with a well defined centroid but 
indeterminate separation. Thus each code index encodes only the centroid of 
the pair of objects and ignores their separation. This is a new type of encoder 
that arises when the objects are correlated, and it will be called an invariant 
encoder, in recognition of the fact that its output is invariant with respect to 
the separation of the objects. 

Note that in these results there is a large amount of overlap between the code 
cells, which should be taken into account when interpreting the illustration in 
figure [HI 

4 Conclusions 

The numerical results presented in this paper show that a stochastic vector 
quantiser (VQ) can be trained to find a variety of different types of way of 
encoding high-dimensional input vectors. These input vectors are generated 
in two stages. Firstly, a low-dimensional manifold is created whose intrinsic 
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coordinates are the positions of the objects in the scene; this corresponds to 
generating the scene itself. Secondly, this manifold is non-linearly embedded to 
create a curved manifold that lives in a high-dimensional space of image pixel 
values; this corresponds to imaging the generated scene. 

Three fundamentally different types of encoder have been demonstrated, 
which differ in the way that they build a reconstruction that approximates the 
input vector: 

1. A factorial encoder uses a reconstruction that is superposition of a number 
of vectors that each lives in a well defined input subspace, which is useful 
for discovering constituent objects in the input vector. This result is a 
type of independent component analysis (ICA) jj. 

2. A joint encoder uses a reconstruction that is a single vector that lives in 
the whole input space. This result is basically the same as what would be 
obtained by using a standard VQ [7J. 

3. An invariant encoder uses a reconstruction that is a single vector that 
lives in a subspace of the whole input space, so it ignores some dimensions 
of the input vector, which is therefore useful for discovering correlated 
objects whilst rejecting uninteresting fluctuations in their relative coordi- 
nates. This is similar to self-organising transformation invariant detectors 
described in |12| . 

More generally, the encoder will be a hybrid of these basic types, depending 
on the interplay between the statistical properties of the input vector and the 
parameter settings of the SVQ. 
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A Derivatives of the Objective Function 

In order to minimise Di + D2 it is necessary to compute its derivatives. The 
derivatives were presented in detail in [9j for a single stage chain (i.e. a single 
FMC). The purpose of this appendix is to extend this derivation to a multi- 
stage chain of linked FMCs. In order to write the various expressions compactly, 
infinitesimal variations will be used thoughout this appendix, so that S (uv) = 
Su V + uSv will be written rather than = ^ ^ ~^ ^ ^ (for some parameter 

9). The calculation will be done in a top-down fashion, differentiating the 
objective function first, then differentiating anything that the objective function 
depends on, and so on following the dependencies down until only constants are 
left (this is essentially the chain rule of differentiation) . 

The derivative of Di + D2 (defined in equation \TO\i is given by 
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1=1 1=1 
i(0 .^A n(0 



The derivatives of the D[ and D2 parts (defined in equation [71 with ap- 
propriate (/) superscripts added) of the contribution of stage I to -Di + D2 are 
given by (dropping the (/) superscripts again, for clarity) 

6D, = ^/dxPr(x)V(' ^Pr(y|x)||x-x'(y)ir ^ 

nj ^^^nxj^^^(^ +2Pr(y|x)(<5x-5x'(y)).(x-x'(y)) J 

6D, = /dxPr(x)f; 



^x 



_ M f <5Pr(y'|x)x'(y') , , 
^a'=i +Pr(y'|x)(5x'(y') j 

•(x-E^LiPr(y'lx)x'(y') 

In numerical simulations the exact piecewise linear solution for the optimum 
Pr(y|x) (see section 1^75)) will not be sought, rather Pr(y|x) will be modelled 
using a simple parametric form, and then the parameters will be optimised. This 
model of Pr (y|x) will not in general include the ideal piecewise linear optimum 
solution, so using it amounts to replacing Di + D2, which is an upper bound on 
the objective function D (see eguation llOp . by an even weaker upper bound on 
D. The justification for using this approach rests on the quality of the results 
that are obtained from the resulting numerical simulations (see section |3]). 

The first step in modelling Pr (yjx) is to explicitly state the fact that it is a 
probability, which is a normalised quantity. This may be done as follows 

P,^y\^)=QM^ (16) 

where Q (y|x) > (note that there is a slight change of notation compared with 
[3, because Q (y|x) rather than Q {x.\y) is written, but the results are equiva- 
lent). The Q (y|x) are thus unnormalised probabilities, and X^^-Li Q (j/'l^) is 
the normalisation factor. The derivative of Pr (yjx) is given by 



(17) 



y' = l 



The second step in modelling Pr (t/|x) is to introduce an explicit parameteric 
form for Q (?/|x). The following sigmoidal function will be used in this paper 

Q (2^1^) ^ TT ? h ^ (1^) 

1 + exp (-v^r (y) •x-6(y)) 

where w (y) is a weight vector and b (y) is a bias. The derivative of Q (y|x) is 
given by 

SQ (y|x) = Q (y|x) (1 - Q (y|x)) {Sw (y) • x + w (y) -S^ + Sb (y)) (19) 
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There are also (5x derivatives in equation [1^] and equation [TOl The (5x deriva- 
tive arises only in multi-stage chains of FMCs, and because of the way in which 
stages of the chain are linked together (see equation [S]) it is equal to the deriva- 
tive of the vector of probabilities output by the previous stage. Thus the Sx 
derivative may be obtained by following its dependencies back through the stages 
of the chain until the first layer is reached; this is essentially the chain rule of 
differentiation. This ensures that for each stage the partial derivatives include 
the additional contributions that arise from forward propagation through later 
stages, as described in section [531 

There are also Sx.' (y) derivatives in equation ll51 but these require no further 
simplification. 
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